Fast genome and metagenome distance estimation using MinHash
نویسندگان
چکیده
Given a massive collection of sequences, it is infeasible to perform pairwise alignment for basic tasks like sequence clustering and search. To address this problem, we demonstrate that the MinHash technique, first applied to clustering web pages, can be applied to biological sequences with similar effect, and extend this idea to include biologically relevant distance and significance measures. Our new tool, Mash, uses MinHash locality-sensitive hashing to reduce large sequences to a representative sketch and rapidly estimate pairwise distances between genomes or metagenomes. Using Mash, we explored several use cases, including a 5,000-fold size reduction and clustering of all ~55,000 NCBI RefSeq genomes in 46 CPU hours. The resulting 93 MB sketch database includes all RefSeq genomes, effectively delineates known species boundaries, reconstructs approximate phylogenies, and can be searched in seconds using assembled genomes or raw sequencing runs from Illumina, Pacific Biosciences, and Oxford Nanopore. For metagenomics, Mash scales to thousands of samples and can replicate Human Microbiome Project and Global Ocean Survey results in a fraction of the time. Other potential applications include any problem where an approximate, global sequence distance is acceptable, e.g. to triage and cluster sequence data, assign species labels to unknown genomes, quickly identify mistracked samples, and search massive genomic databases. In addition, the Mash distance metric is based on simple set intersections, which are compatible with homomorphic encryption schemes. To facilitate integration with other software, Mash is implemented as a lightweight C++ toolkit and freely released under a BSD license at https://github.com/marbl/mash. Introduction When BLAST was first published in 1990, there were less than 50 million bases of nucleotide sequence in the public archives (http://www.ncbi.nlm.nih.gov/genbank/statistics); now a single sequencing instrument can produce over 1 trillion bases per run. New methods are needed that can manage and help organize this scale of data. To address this, we consider the general problem of computing an approximate distance between two sequences and describe Mash, a general-purpose toolkit that utilizes the MinHash technique to reduce large sequences (or sequence sets) to compressed sketch representations. Using only the sketches, which can be thousands of fold smaller, the similarity of the original sequences can be rapidly estimated with . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/029827 doi: bioRxiv preprint first posted online Oct. 26, 2015;
منابع مشابه
Kmerlight: fast and accurate k-mer abundance estimation
k-mers (nucleotide strings of length k) form the basis of several algorithms in computational genomics. In particular, k-mer abundance information in sequence data is useful in read error correction, parameter estimation for genome assembly, digital normalization etc. We give a streaming algorithm Kmerlight for computing the k-mer abundance histogram from sequence data. Our algorithm is fast an...
متن کاملSuperMinHash - A New Minwise Hashing Algorithm for Jaccard Similarity Estimation
is paper presents a new algorithm for calculating hash signatures of sets which can be directly used for Jaccard similarity estimation. e new approach is an improvement over the MinHash algorithm, because it has a beer runtime behavior and the resulting signatures allow a more precise estimation of the Jaccard index.
متن کاملMinHash Sketches: A Brief Survey
Sketches are a very powerful tool in massive data analysis. Operations and queries that are specified with respect to the explicit and often very large subsets, can be processed instead in sketch space – that is, quickly (but approximately) from the much smaller sketches. MinHash sketches (Min-wise sketches) are randomized summary structures of subsets (or equivalently 0/1 vectors). The sketche...
متن کاملIn Defense of Minhash over Simhash
MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHa...
متن کاملMetaScope - Fast and accurate identification of microbes in metagenomic sequencing data
MetaScope is a fast and accurate tool for analyzing (host-associated) metagenome datasets. Sequence alignment of reads against the host genome (if requested) and against microbial Genbank is performed using a new DNA aligner called SASS. The output of SASS is processed so as to assign all microbial reads to taxa and genes, using a new weighted version of the LCA algorithm. MetaScope is the winn...
متن کامل